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Many mlssense substitutions are identified in single nucleotide polymorphism {SNP) data and large-scale random 
mutagenesis projects. Each amino add substitution potentially affects protein Function. We have constructed a 
tool that uses sequence homology to predict whether: a substitution affects protein function, sift, which sorts 
^tolerant from tolerant substitutions, classifies substitutions. as tolerated or deleterious. A higher proportlo'n of 
substitutions predicted to be deleterious by sift gives an affected phenotype than substitutions predicted to be 
deJeterlous by substitution scoring matrices In three! test cases. Using sift before mutagenesis studies could 
reduce the number of functional assays required and yield a higher proportion of affected phenotypes. 9ift 
may be used to Identify plausible disease candidates among the SNPs that cause mlssense substitutions. 



Identifying substitutions that affect protein function is ' 
of major interest for those studying proteins and their • 
implications in disease. Disease-causing mutations \ 
tend to occur in structurally and functionally impor- : 
tant sites, and a significant fraction of polymorphism j 
sites are located in these regions (Sunyaev et al. 2000). 
It is estimated that each person is heterozygous for; 
24,000-40,000 amino acid-altering substitutions (Car- \ 
gill et al. 1999). Predicting substitutions at these sites as • 
deleterious or neutral may help identify disease- 1 
associated alleles. A recent single nucleotide polymer- ; 
phism (SNP) study used an amino acid substitution . 
scoring matrix, BLOSUM62, to classify each amino acid j 
substitution caused by a SNP in a coding region as con- . 
servative or nonconservative (CargiU et al. 1999). How- . 
ever, use of a substitution scoring matrix may be! inap- 
propriate for predicting whether an amino acid substi- I 
tution will affect a protein's function or structure j 
because it generalizes and does not incorporate infor- , 
mation specific to the protein of Interest. 

Substitution scoring matrices, such as BLOSUM62, j 
have not been tested against experimental data fori 
their ability to predict protein-altering substitutions. 
The BLOSUM62 matrix, like most matrices, is Intended ■ 
for database searching and palrwise alignment (Heni- j 
koff and Henikoff 1992), which is a different task than I 
predicting deleterious substitutions. Substitution ma- j 
trbc scores are typically calculated from a log oddi ratio j 
of target frequencies, obtained by counting pairs of j 
aligned amino acids, with the background frequencies i 
of the amino acids. Substitutions to a more abundant 
amino acid have a lower score relative to a less abun- 'j 
dant amino acid because the background frequency is | 
lower for the less abundant amino acid. However^ the i 
overall abundance of an amino acid is Irrelevant when i 
considering whether an amino acid change is toler- j 
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a ted. On average, 14 out of the 19 possible substitu- 
tions for a given amino acid have negative scores from 
the BioSUM62 matrix and are deemed nonconserva- 
tijve by CargilJ et al. (1999). If nonconservative substi- 
tutions are predicted to be deleterious, then many sub- 
stitutions will be predicted to affect phenotype. How- 
ever, proteins actually contain many positions that 
have a high degree of plasticity in accommodating 
amino acid substitutions, as shown In previous muta- 
genesis studies (Bowie and Sauer 1989; CHmle et al. 
1990; : Huang et al. 1992; Markiewicz et al. 1994). 
Therefore, experimentally testing all changes deemed 
nonconservative by a substitution matrix would be 
time-consuming and wasteful because of this overpre- 
diction, especially for large-scale studies such as exami- 
nation of nonsynonymous SNPs (Lander 1996; Irizarry 
et al. 2000) or in genome-wide random mutagenesis 
projects (Bentley et al. 2000; Chen et al. 2000; McCal- 
lum et al. 2000). 

Given a protein query, aligned sequences from the 
protein's family give position-specific information, 
whicfv a substitution scoring matrix lacks. Residues 
that aie conserved completely in the protein family are 
expected to be important for function, and even a 
conservative substitution at one of these residues may 
affect [protein function. A substitution matrix may 
underestimate the severity of deleterious substitutions 
at the|se crucial positions. At some positions, any 
amino; acid change can be tolerated in the protein 
if these positions ace not involved in protein func- 
tion or structure. Because these are expected to be neu- 
tral substitutions, one might expect amino acids in 
these positions of a protein alignment to be diverse. 
Therefpre, the accuracy for predicting the phenotype 
that results from an amino acid substitution based on 
sequence alignment of protein family members should 
be better than using a generalized substitution scoring 
matrix;. 

sift is a sequence homology-based tool that sorts 
intolerant from tolerant amino add substitutions and 
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predicts whether an amino acid substitution at a par- 
ticular position in a protein will have a phenotypic 
effect, sift predicts the phenotype resulting from a 
substitution more accurately than substitution scoring 
matrices for three data sets. In some exceptional cases, 
a substitution is predicted by sift to be neutral but 
experimentally does have a deleterious effect;- these 
can be accounted for by query-specific interactions 
that are not conserved among the protein family: mem- 
bers. 

RESULTS 
Rationale 

SIFT takes a query sequence and uses multiple 'align- 
ment information to predict tolerated and deleterious 
substitutions for every position of the query sequence. 
sift is a multistep procedure that, given a protein 
sequence, (1) searches for similar sequences, (2) 
chooses closely related sequences that may share simU 
lar function, (3) obtains the multiple alignment of 
these chosen sequences, and (4) calculates normalized 
probabilities for all possible substitutions at each posi- 
tion from the alignment. Substitutions at each position 
with normalized probabilities less than a chosen icutoff 
are predicted to be deleterious; those greater than or 
equal to the cutoff are predicted to be tolerated: 

To test the procedure against experimental- data, 
we chose unbiased data sets in which mutagenesis was 
performed throughout the entire protein, and; both 
witd-type and negative phenotypes were assayed. 
There were only three data sets that we could find in 
the literature that fit the above criteria: Lad !(Mar- 
kiewicz et al. 1 994; Suckow et al. 1996), HlV-1 protease 
(Loeb et al. 1989), and bacteriophage T4 lysozyme. 
(Rennell et al. 1991). The scarcity of unbiased data sets 
indicates how difficult characterization of mutant pro- 
teins on a large scale can be. 

The goal of the prediction program is to Identify 
less severe but nonetheless affected phenotypes as 
well as null phenotypes from wild-type. Therefore, 
phenotypes that exhibited weakened activity in the 
functional assays were grouped with loss-of-jfunc- 
tion phenotypes. sift and substitution scoring matri- 
ces, BLOSUM55, BLOSUM62, and BLOSUM80,; were 
tested for the ability to predict these substitutions as 
deleterious, sift parameters used on the IHV-1 prote- 
ase and bacteriophage T4 lysozyme data sets were the 
same as those determined to work well for the| Lad 
mutation data, so sift analysis can be generalized to 
any protein for which homologous sequences are avail- 
able. 

Comparison oFSIFT with BLOSUM62 Predictions on 
Lad Mutation Data 

Lacl is a DNA-binding protein that normally represses 



transcription of the lac operon. Upon binding of a 
P-galactostde sugar inducer, Lad no longer binds to 
DNA,| thus allowing the organism to use lactose as an 
energy source. Positions in the Escherichia colt lac re- 
pressor gene were mutated individually to amber non- 
sense icodons (Markiewicz et al. 1994; Suckow et al. 
1996}. In each mutant, nonsense suppressor tRNAs 
that would insert 1 of 13 different amino acids at the 
engineered amber codon had been introduced so that 
>4O00 amino acid substitutions were analyzed. Using a 
(3-galactosidase colorimetric assay, each protein with a 
single amino add substitution had been tested for its 
ability to (1) repress transcription at the lac operator 
and (£) cease repression upon binding of IPTG, the 
induder sugar. More than 50% of the sites were gener- 
ally tolerant to substitutions, and the regtons that were 
sensitive to amino add replacements were primarily at 
the DNA and inducer binding sites and at the dimer 
interface (Pace et al. 1997). We compared predictions 
from sift and the substitution scoring matrices with 
the resulting phenotypes from the substitutions exam- 
ined In the mutagenesis studies. 

F<?r SIPT to predict on Lacl substitutions, it must 
first select sequences related to the repressor. Combin- 
ing the results of sequences found In the SWISS-PROT/ 
TrEMBL 38 protein database (Balroch and Apweiler 
2000) |and in the translated microbial genomes, SIFT 
found? 55 sequences similar to Lacl. Those chosen from 
SWISSj-PROT/TrEMBL were annotated as belonging to 
the Lajcl family of transcriptional regulators. Although 
the chosen sequences are generally involved In tran- 
scriptional repression relieved by an inducer, the 
operators and inducers that interact with these pro- 
teins jare different from that of Lacl. For example, 
RBSRJeCOU represses the rlbose operon and relieves 
repression by addition of ribose. Another selected se- 
quence, PURR.HAEIN, binds to the PUR operator in 
the presence of guanine and loses affinity for the op- 
erator! without the corepressor. With this collection of 
proteins, overall structure is expected to be conserved, 
but npt necessarily residues involved in binding DNA 
or inducer. 

Trie collection of Lacl-related sequences was used 
to measure the correlation between sequence conser- 
vation and tolerance to substitutions. To predict 
whether an amino acid substitution is deleterious 
based t>n sequence homology, the degree of conserva- 
tion a^ a position should be correlated positively with 
the number of deleterious substitutions at this posi- 
tion. From information theory (Schneider et al. 1986), 
conservation can be measured at each position and 
ranges! from zero bits at a position equally represented 
by all 20 amino acids to 4.3 bits at an invariant posi- 
tion. Strongly conserved positions are expected to be 
unable to tolerate most substitutions, whereas weakly 
conserved positions are expected to tolerate more sub- 
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stituttons (see Pig. 1 for an example). Conservation was , tutioibs 
calculated for each position using an alignment of the • sift 
55 chosen sequences. The Pearson correlation icoeffi- 
cient between conservation and the number of delete- 
rious substitutions determined experimentally at each 
position is 0.550. This is a conservative estimate be- 
cause proteins in the alignment bind to different in- 
ducers and operators, so that positions important for 
inducer and DNA binding may not necessarily be con- 
served throughout the protein sequences. Also, the ex- 
perimental data contain only 12 or. 13 substitutions at 
each position, whereas up to 20 amino acids are.repre- 
sented in the alignment. The high correlation between 
experimental mutation data and conservation sup- 
ports the idea that we can predict from sequence data 
whether a given substitution afreets protein function 
or structure. j that 

SIFT made predictions from the LacI sequence 101-: 
alignment (Fig. 2A) and showed higher total and cx- '. not 
perimental prediction accuracy over BLOSUM6^ (Fig. toons; 
2B), as summarized in Table 1. SIFT predicted 1747 erant 
out of the 2254 (78%) experimentally tolerated substl- tested 
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Figure 1 Sequence conservation corresponds to intolerant positions. {Top) 
representation (Schneider and Stephens 1 990) of the Lad multiple alignment for positions 
a region involved in binding DNA. At each position, the stack of letters Indicaijes 
acids appear in the alignment, and the total height of the stack is aj measure 
(aotfom) Number of substitutions deleterious to Lad function at the correspt 
(Markiewicz et al. 1994; Suckow et al. 1996). Positions with high conservation, 
do not tolerate substitutions. Positions with low conservation, such as 26-2B, 
substitutes. Positions 17 and 18 appear diverse In the alignment but car 
substitutions. The side chains of these residues are involved in DNA-speciflc 
prina et al. 1993) that is not conserved among the paralogous sequences. 
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For substitutions with an affected phenotype, 
correctly predicted 989 of 1750 (57%) of these 
Amino acid substitutions with BLOSUM62 
ssO are classified as conservative substitutions 
et al. 2000) and occur more or as frequently 
expected by chance in a database of alignments; 
substitutions are predicted as tolerated. Substltu- 
wlth negative scores are classified as nonconser- 
changes (Cargill et al. 2000), and these changes 
less frequently than expected by chance; 
substitutions are predicted as deleterious. 

predicted 84% (1475/1750) of the delete- 
changes because many of its amino acid substi- 
scores are negative (Fig. 2B, positions 1-50). 
predicted only 31% of the tolerated sub- 
accurately and performed poorly in regions 
tolerate many substitutions (Fig. 2B, positions 
This substitution scoring matrix alone did 
distinguish between conserved and variable posl- 
mispredicting substitutions as deleterious at tol- 
[)OSitions. BLOSUM80 and BLOSUM45 were also 
for prediction and performed poorly compared 
to sift in a similar manner to 
BLOSUM62 (data not shown). 
Because sift uses sequence- 
specific Information, it can dis- 
tinguish between the conserved 
and variable positions to get 
better prediction performance. . 

The total number of cor- 
rectly predicted substitutions by 
SIFT exceeds that of BLOSUM62 
by 14% (Table 1, difference in 
total prediction accuracies). Of 
substitutions predicted to be 
deleterious by sift, 66% will 
yield a deleterious phenotype 
experimentally by the P-galac- 
tosidasc assay (table 1, experi- 
mental prediction accuracy). In 
comparison, only 49% of the 
substitutions predicted to be 
deleterious by BLOSUM62 will 
I . yield a deleterious phenotype 

I j experimentally. A higher pro- 

| j |j portion of substitutions pre- 
m ^ m ^ £ -" dieted to be deleterious will 
give deleterious phenotypes ex- 
perimentally If sift, rather 
than BLOSUM62, is used for 
prediction. The number of sub- 
stitutions predicted to be del- 
eterious is smaller for sift 
(1496) than for BLOSUM62 
(3033). Not only does SIFT pre- 
dict more accurately, but also 
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the number of predicted deleterious substitutions is 
smaller. These numbers indicate that if SiFT-predicted 
deleterious substitutions rather than BLOSLTM62- 
predicted deleterious substitutions are used as a. guide 
for conducting experiments on mutant proteins,! then 
(1) fewer experiments would have to be performed, 
and (2) a higher proportion of the experiments will 
yield affected phenotypes. 

Although sift does well at most .positions, it 
misses predicting substitutions involved in Lacl- 
specific recognition. There are 158 positions that can- 
not tolerate six or more substitutions, yet sift pre- 
dicted 56 of them to tolerate more than half of the 
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delete lous substitutions. The side chains at four of 
these j>ositioris are Involved in DNA-blnding contacts 
fflig. I positions 17-18; Fig. 2A, double helices); the 
side chains at nine other positions participate at the 
dimer! Interface (Fig. 2A, double cylinders; Chuprlna et 
al. 1993; Bell and Lewis 2000). Other specific contacts 
might! involve IPTG binding, but these are unknown 
because the structure solved for this complex had low 
reisolution so that side-chain interactions could not be 
Identified (Lewis et al. 1996). Nevertheless, of the 158 
positions that do not tolerate six or more substitutions, 
there ire 31 positions (20%) where at least six of the 
substitutions cannot respond to the inducer IPTG. If 
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the 56 positions that were mispredicted as tolerant to 
substitutions were distributed randomly, then, one 
would expect approximately 11 (0.20 X 56) positions 
to coincide with the positions sensitive to Inducer. In- 
stead, 20 positions (36%) were observed to coincide 
with inducer-sensitive positions, indicating that many 
sift mispredictions of intolerant positions to be tol- 
erant are due to lack of conservation in the alignment. 



! sift jhispredicts when the alignment does not reflect 
i the constraints on the individual protein. 

FT's prediction is based on paralogous se- 
quences in the Lad family. Although these sequences 
share ji similar function to Lad, they do not have the 
same DNA operators or sugar inducers. Residues in- 
volvec directly in Lacl repressor's function may not be 
conseijyed throughout the alignment. Such positions 
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Figure 2 (A) sift predictions for substitutions In Lad. The effects bf 1 2-1 3 jlubstitutipns at each position were assayed (Markiewicz et 
al. 1 994; Suckow et al. 1 996). The number of substitutions aboVe the X-pxis a e those that gave a wild-type phenotype; the number of 
substitutions below the X-axis gave an affected phenotype. sift maizes alpredl *on for Seven/ possible substitution, but only substitutions 
predicted correctly by sift are depicted here and are colored in bjackj GraJj bars above the x-axis indicate false positive error; these 
substitutions were predicted to be deleterious by sift, when experimentally they gave' wild-type phenotype. Cray bars below the x-axls 
Indicate true negative error; these substitutions were predicted- to be neutral, ftaut in fact gave an affected phenotype. Amino acid side 
chains that have been Identified as involved in interactions (Chyprina etjal. 1 993; Bell and Lewis 2000) are labeled as follows: (double 
helix) those that interact with DNA, (doubly cylinders) those parfidpajtlngiin th& dimer interface. (Hexagons) Positions having six or more 
substitutions that are unable to respond to; the inducer (Marklevyicz et all 19M; Pace et al. 1997). Many of the Intolerant positions that 
were predicted to tolerate substitutions correspond to these qifery-ipecific positions. (Asterisks) Positions that can tolerate at least six 
substitutions, but Sift predicted more than half of these subslitutions as defeterious.lThe consensus sequence and the original query 
sequence, LACl.ECOLI, are shown. (8) BLpSUM62 prediction, for substitutions in Lad for positions 1^50 and 101-150. BLOSUM62 
performs well in the DNA-binding region (residues 1-50) because tHis region 'cannot tolerate many substitutions. However in a region 
that tolerates substitutions, such as positions 101-150, BLOSUM62 performs riborly, predicting many experimental false positives (larqe 
gray bars above the X-axls). !l 



will appear variable In an alignment of paralogous se- 
quences and cannot be identified a]s important from 
sequence alone. The lack of conservation at these po ' 
sitions leads sift to miss these intolerant positions. 

There are well conserved positions in the align- 
ment that can tolerate substitutions! according to the 
p-galactosidase assay. A substitution occurring at one 
of these positions will be predicted; to affect protein 
function, although experimentally i|t will have no ef- 
fect; this would be a false positive in a functional assay 
Interestingly, a majority of the portions with high 
false positive error cluster at one face on the C-terminal 
subdomain (red residues in Fig, 3). The structure of the 
core tetramer does not implicate this face to be in 
volved.in tetramerization (Friedman; et al. 1995); and 
other repressors in the alignment, function as dimers. 



Perhaps this C-terminal face is Involved in as yet un- 
' discovered interaction. 
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rlson <pf sift with BLOSUM62 Predictions 
Protease Mutation Data 

>roteasfe cleaves the gag and gag-pol polyprotelns 
ature products and is therefore necessary for 
irus maturation. HIV protease must recognize 
onhormologous sites within the HIV polypro- 
oeb and bis colleagues (1989) tested the effect 
single; missense mutations in HTV-1 protease. 
»ns were generated by random mutagenesis, se- 
lf and then scored for their ability to process 
precursor. Missense mutants were placed in one 
e categories: (1) wild-type, (2) intermediate, for 
jboth processed and unprocessed products were 
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observed, and (3) negative, for which no mature pro- 
cessed products were produced by the protease., si 
and three substitution matrices from the BLOStJM se 
ries were tested for their ability to predict substitutions 
with intermediate and negative phenotypes as delete- 
rious and substitutions with wild-type phenotype ai; 
tolerated. 

The predictions returned by sift under default pa 
rameters are more accurate than those of BLOSUM62 
for HTV-1 protease (Table 1). Because the TtEMrJL da 
tabase may contain mutant HIV-1 protease sequences 
which aie not necessarily functional, sequences wen 
chosen from the SWISS-PROT database. Thirty-elghl 
proteases were chosen, with the moit distantly relatec 
sequence being 30% Identical to thi query sequence 
SIFT performed better than BLOSUljtf62 for predicting; 
both neutral and deleterious substitutions (Table 1) 
Out of 215 substitutions predicted by sift to be del- 
eterious, 85% give an affected ph^potype (Taole 
experimental prediction accuracy) using the protease 
assay by Loeb et ai. (1989). i- 

Although the total prediction accuracy of sift exl 
ceeds that of BLOSTJM62 by 8% (Table 1), performance 
can be improved further by basing predictions on an 
alignment of sequences with similar |sub£trate specific- 
ity. The sift alignment contained protease sequences 
from the Rous sarcoma vtrus (RSV) and avian myelo- 
blastosis virus (AMV), which differ from each other in 
only one residue. Although their structures are very 
similar to HIV-protease (Wlodawer et ai. 1989), AMV 



has b|en shown to have substrate specificity distinct 
from humanj HIV protease (Tomasselli et ai. 1990). 
Also, he sifit alignment of RSV and AMV with HIV-1 
proteo se did |iot:match the structural alignment (Wlo- 
dawei et al. ij989> at some positions. These specificity 
differ nces arid misalignments may have reduced sift 
perfo3jmance| Therefore, RSV and AMV protease se- 
quen< es wen removed, so that the remaining 36 se- 
quences In the alignment are proteases from humans 
and simians. SIV protease has substrates homologous 
to HIY proteose and has been shown to cleave HTV-1 
polypptein substrate in a manner similar to HIV-1 
(Grani et al.i 1991), Thus, prediction based on this 
alignment srjould not be confounded by substrate- 
spedfic residues as much as prediction based on the 
alignment containing RSV and AMV protease se- 
quences. Indeed, sift performance based on the 
alignr lent without RSV and AMV protease sequences 
was 3 to better than sift performance on the align- 
ment vith thjeSe sequences (Table 1). BLOSUM80 and 
BLOSU7M45 ^ere also tested for prediction and per- 
forme I poorly compared with sift. (data not shown). 
The pedictiojn accuracy for deleterious substitutions 
lncrea led whin AMV and RSV proteases were excluded 
becaiu e residues Important for substrate specificity 
may b : conserved in the alignment of human and sim- 
ian viijal proteases. Prediction for neutral substitutions' 
decreases only slightly, which indicates that the re- 
maining protease sequences are diverse enough for pre- 
diction. 1 
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Figure 3 (4) Structure of Lad as a homodfmer (light and darlc blue 
interface is important for DNA binding and the allosterlc mechaViisrji 
bottom. The 186 positions tolerant for six or more substitutions are 
et al. 1 996). For 31 of these positions, >50% of the substitution's 
experimentally they did not (see also Fig. 2, asterisks). TrSese pbsitic ns 
occurred at the bottom face of the C-termiral domain. This structure 
about the Z-axis. 



We examined the literature to account for mispre* 
dictions at some positions. Several intolerant positions 
that sift predicted to tolerate substitutions cluster to-: 
gether in region 35-40. Residues 36-46 show; large 
structural deviations and are implicated in HlVjprote- 
ase adaptation to binding of the substrates (Prabu- 
Jeyabalan et al. 2000), so that errors at|these residues; 
could be accounted tor by substrate Specificity. Iti gen- 
eral, sift predicts better than substitution matrices on 
HIV-1 protease mutation data; with careful selection of 
sequences and comparison of structure*, performance 
can be improved further. 

: , i 

Comparison of SIFT with BLOSUM62 Predictions 
on Bacteriophage T4 Lysozyme Mutatibn Data ; 

The final test case, which uses mutatibn data froijn bac- 
teriophage T4 lysozyme, shows that sift can improve;, 
prediction remarkably when only orie homologous se«; 
quence is available. Bacteriophage T4 produces a 
soluble lysozyme that breaks up bacterial cell walls late 
in the infection of £. coli. Bacteriophage T4 lysozyme 
was subjected to a mutagenesis studyiusihg amber sup- 
pressor tRNAs (Rennell et al. 1991). Similar to the LacI 
results, approximately half of the positions could tol- 
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jjtrands) wfth QNA jyellow strand). The N-terminal subdomain whose 
is at thejupber part of the figure; the C-termfnal domain is at the 
olored inj white oil one monomer (Markiewicz et al. 1 994; Suckow 
were pj-edicted to affect phenotype according to SIFT when 
are shewn, as s 5ace-flll : atoms in red. Noticeably, many of these 
Is 1 EFA fri »m PDB ^BelJ and Lewis 2000). (fl) Same figure rotated 90* 



erate ill tested substitutions. Lysozyme function was 
assay* d by pi ique fbrmation, and mutants were scored 
by pli qu6 size. Mutants with plaques the same size as 
wild t fpe" we :e scored as wild-type. Intermediate phe- 
notyp is were scored for mutants with smaller plaque 
size. Xjfutants that produced no plaques were scored as 
Ve ,test id whether SIFT could predict a mutant 
j wild-type phenotype as tolerated, and mutants 
athfer intermediate or null phenotypes as delete- 
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len using the automated procedure for choos- 
iil|ar proteins, the lysozyme amino acid se- 
: was upable to meet SIft's criteria for choosing 
sejquehces.An error was returned to the user, 
in£ th.it there were not enough sequences and 
ier ;sho ild examine the results manually. The 
sift ilighment had gaps occurring in regions corre- 
sponding: to secondary structure and in a core region 
that is conser red among distant proteins (Monzingo et 
al. 19* 6),'OnJy VG05_BPT4, a tail^ssociated lysozyme 
in bac:eriOphage T4, aligned well with bacteriophage 
T4 soluble lysrpzyme (43% identity, 3% gaps). This pro- 
tein L similajr in function to the soluble lysozyme 
because a tailUssociated lysozyme mutant can substi- 
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tute for it (Kao and McClain 1980).;|The biologibal evi- 
dence and the global pairwise alignment supports J 
VG05_BPT4 as a good candidate for sift predictib i 
on bacteriophage T4 Iysozyme. j ; 

With sequence information from ust the Soluble 
Iysozyme query and VG05_BPT4, ^I^T yieldsi better 
prediction results than BLOSUM62. Twice as! many 
neutral substitutions are predictejd Correctly] when 
compared with BLOSUM62 (59% vs! 30%), with; a 139b 
reduction in predicting deleterious substitutions si> 
that total prediction accuracy was 259f> higher l!(Tabl5 
1). sipt also performed better thaii BLOSUM80 arid 
BLOSUM45 (data not shown). There] are many tolerant 
positions predicted intolerant presumably due to bas- 
ing the prediction on only two sequences. Some Intol- 
erant positions predicted incorrectly to tolerate isubsti- 
tutions may be residues that specifically recognize tie r 
tertal cell wall composition, because the soluble 
Iysozyme destroys bacterial cell walks from thciinsid : 
whereas taiUassociated Iysozyme reipgnizes cell wall? 
from the outside (Mosig et al. 1989; Nakagawa et al. 
1985>. The performance on this mutation data set 
shows that additional information from just a! single 
homologous sequence can yield berter! prediction Re- 
sults than predictions from substitution matrices. 

DISCUSSION J : 

SIFT is a novel tool that Incorporatds ppsition-speciiii: 
information by using sequence alignment and is in- 
tended specifically for predicting whether an. amino 
acid substitution affects protein function. For all three 
test cases, sift had a higher number of correctly pre . 
dieted substitutions than the subsnAitioh scoririg ma!- 
trices. Moreover, a higher proportion of substitiutiohii 
predicted to be deleterious by siFTjhaer affecteji phe ■ 
notypes in the experimental assays than substitution;; 
predicted to be deleterious by substitution matrices 
For all of the data sets, sift made? fe^ver imspredrfe- 
tlons than the substitution matrices |hat a substitution 
was deleterious when it was toleratejd experimentally 
For two out of the three data sets, sJift missedf mort 
deleterious substitutions than the substitution storing 
matrices. Some of these errors were accounted by 
query-specific interactions that are j not conserved li- 
the family. | ! j 

sift bases its predictions on sequence da taj alone 
and does not depend on knowledge, of [protein! struc- 
ture or function. Substitutions in urrtnararterized pro- 
teins can be evaluated by SIFT only when homolo- 
gous sequences are available. Although sif^t car 
choose sequences automatically, better; prediction re- 
sults are obtained when a list of horriololgs is projride^ 
as seen with the HIV-protease mutatipn data. The ide^jl 
set of sequences for sift prediction is well-alignecj 
orthologous sequences. Paralogs With distinct bio- 
chemical functions will confound prediction at rest 
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duesfeonsenjed only among the orthologs. However, as 
prote|n databases grow with data from whole genome 
sequencing, k larger number of orthologs will become 
available! and sift prediction should become more ac- 
cural! 

pjrisliigly, few sequences are needed by sift to 
Qbsejte improvement of prediction over a substitution 
scoring matmx. ;In the case of Iysozyme, we observed 
that T^ith only one sequence homologous to the test 
sift prediction is significantly better than us- 
[enerafized substitution scoring matrix for pre- 
. thisiindicates that with only a single diverged 
je, SIFT can offer better prediction than a substi- 
m|at^^jc.• 

.or iresults indicate that given a set of substitu- 
tions jto fcssay, those substitutions predicted to be del- 
eters jjs by ^ift will yield a greater proportion of af- 
fects phencjtypes compared with substitutions [udged 
to be noncohservative by substitution scoring matri- 
ces. S me of the substitutions predicted to be tolerated 
by Si ?rlfnay in fact be deleterious; the Lad test case 
showld that] sift Is unable to Identify residues that 
portarit for function but have not been cbn- 
throughout the family. Positions predicted to be 
ant by Sift but which tolerate substitutions 
ing to p?c functional assay might be involved in 
ownj. function that the assay does not detect. 
In Lajt rhariv of the conserved positions that tolerate 
substitutions in the!DNA- and sugar-binding assays oc- 
;ether pt an exposed face in the C-terminal sub- 
.Because tfjese residues have been conserved 
diverjged but functionally related sequences, 
lqates 1 that tfiis C-terminal face may participate 
syjet u lknown interaction. Substitutions at con- 
pOsltiDns, which still behave as wild-type in 
hanctibnal assays, nevertheless may be Invblved in a 
functi >n in vivo for which the existing assays do not 
test. 
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ttity of scores in log-odds substitution 



scorirjfe ma dices are negative to prevent sequence 
alignments from extending spuriously in database 
searchdngj {Al schul 1991). For example, on average, 14 
out o|the 19 possible substitutions for a given amino 
acid rave; negative scores in BLOSUM62 and are clas- 
sified |s rfonconseryative changes (Cargill et al. 2000). 
If: furJ|ti^nal assays are performed on substitutions 
deem|d noni onseryative by substitution scoring ma- 
trjcesjUany c I the deleterious mutants will be detected 
simpl| because the matrix is dominated by negative 
entriej , This benefit of characterizing most of the del- 
etjetloj s subs itutions when using matrix predictions 
rather] than sift's comes at the cost of assaying sub- 
stituti Hnstha : do not affect phenotype. If there are few 
variarJ s tio tt aractejrize, or it is important to not miss 
any vj Wants :hat alter protein function, then charac- 
terizing all substitutions or those with negative scores 
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in a substitution matrix is a good stiategy. However, . . ji 
large-scale projects in which many riaissjense mutatlops 
are generated, it is more important to minim jze th 
number of unnecessary experiments rather tjian ,:\y 
identify all deleterious substitutional Hence, si£t w 11 
be more efficient than substitution scoring matrices for 
large-scale projects. Preliminary datja show thai Sift 
predicts 69% of more than 3500 di 
stitutions to be deleterious, indicati 
be suitable for automated predlctlc 
scale (data not shown). [ 

Linkage disequilibrium and association itudlek 
make use of polymorphism data to findjgenetic factor \ 
that may cause or Increase risk for a disease. Aracjng ti n ; 
markers identified in association orjinkage disequilib- 
rium studies, sift can predict wjliLich markers thu 
result in an amino acid change may [them selves (be tin; 
cause of a deleterious effect on the fjrotein. Beciuse : pf 
the amount of polymorphism data fleeced to condu< 
linkage disequilibrium and association stucjies,;! 
plethora of missense mutations ard being iderjtiflejj 
and some of the missense variants ttiemielves are likely 
to be involved in disease. Approximately half jof th<! 
gene lesions known to be responsible for hiiroan In 
herited disease are due to amino faciei substitution 
(Cooper et al. 1998), showing that dmino acid substi 
tutions play a large role in diseases, jjn a| study on noA 
synonymous SNPs in proteins fotjwriich strdcturei 
were known (Sunyaev et al. 2000), 43% of the misens|< \ 
variants mapped to structurally and functionally im: 
portant regions, and it was suggested that a large frajt 
tion of nonsynonymous SNPs can Have/ strong jjffecfc 
on the encoded proteins. Sunyaev efjal.i(2000) studli 
only 86 nonsynonymous SNPs beca|ise|they rel|ed 
structure for their analysis. Because sift uses 
quence homology rather than protein structure, !|i 
could potentially analyze a larger number of nonsyft 
onyraous SNPs than studies based on protein structuife ! 
alone. In HGBASB (Brookes et al. 2000),! a public data- 
base of human sequence variants that may or m ry ncti 
be involved in disease, there were 20;k82 gene va land 
of which 3146 caused amino acid substitutions as 
January 2000. It has been predicted that there will 
eventually be -200,000 coding sequence va}laniL: 
(Brookes et al. 2000), which suggests there may | ever* 
tually be 30,000 missense variantsfinj this databank 
alone. The sheer magnitude of missense variants red 
ders it unfeasible to test all of thesd subs tl tut iohs fdli 
their effects on the proteins for which they code: 
cause sift is an automated, relatively, quick frock 
dure, it can be used to predict which-tnissense variants! 
are likely to be deleterious and thusiiorie in on jvhlch! 
ones are likely candidates for disease arid which pro-! 
telns should be subjected to further fin vestigation. ,\ ■ 
sift can also be applied to lafge-iscale, reyersd- 
genetic projects in which mutatioris ate introduced \ 
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randc mfy in {the genome of an experimental organism, 
alters 3 gfenei are identified, and then the phenotype 
for tt e resulting mutants ascertained (Bentley et al. 
^OOOj Chen jit al. 2000; McCallum et al. 2000). A ma- 
(brityj of the Jnutatipris generated in the coding regions 
try tn? dheniical rrrutagens used In these large-scale 
fjrojej :ts |caujse amino acid substitutions. The rate- 
limitij lg step jmay be deciding which mutants to pursue 
for further study. The same dilemma arises when a 
gene i targeted for Random mutagenesis. If the pheno- 
type I f a deleterious mutant is unknown or difficult to 
assay] si pt cjan be used as a guide for which mutations 
are llifely to ike dele'terious to protein function and are 
worth pursuijng. i 
'A | 

^ET|40DS 

Qbtai ling Sequences Related to a Protein of Interest 
s!eft tarts mth a query protein sequence. Relying on the 
bjwerHition trjat protons In the same subfamily have high 
cbnsei larion : n conserved regions (Nevlll-Mannlng ct al. 
I?97),jlt seleqs sequences that are similar to the query se- 
quence byj add Ing the;most similar sequence from a database 
of pro] »lniseqt ences Iterattvery to the growing collection un 
til cori|erVatioh in the conserved regions decreases. We use 
to ml 

lC=ld 

Which ^mlno 

tS?B6>-|r [ 

j P{|i-bla$t (Altschul et al, 1997) with parameter 
f ; e O.OpOlj and -h Q&02 is run for four iterations to collect a 
pbol cjf sequences similar to the query from a protein se- 
qjjiencl database, suchias SWISS-PROT (Bairoch and Apweiler 
2000).||Th^ 5eq iences|ound by psi-blast are then grouped 
tdgethi jr lf| they are >$0% identical in the regions aligned by 
~ add a consensus sequence is made for each group 



jastjire (the conservation at position c where 
; 2 20 - p^log p M where p fa is the frequency at 
to j eld a Appears In position c (Schneider et al. 



(Stall 
consei 
setjui 
sensus 
are 
Once 
icfen' 
quenci 




\ST, t 

*liiig trie amlrto add that occurs most frequently at 
>sltion;< Next,! the motif-finding algorithm MOTIF 
t ajl. 19p0; Herlikoff and Henikoff 1991) is used to find 
2d [regions among the query sequence and consensus 
es that- were derived from at least two sequences. Con- 
eqtiences that! were derived from only one sequence 
jovjd ir the rabtif-flnding step to Increase efficiency, 
e const ryed regions In the query sequence have been 
i py M 3TIF, these regions are extracted from the se- 
aligned. by pbi-blast. The conserved regions are 
g&up^jl together If thpy are >90% identical, and a consensus 
sequen :e 14 mai le for eich group. The conserved regions of the 
query :qu|enci and tHpsc consensus sequences >90% identi- 
cail arej [onyerted to a psi-blast checkpoint file. This check- 
point f le is tht seed \o which additional sequences will be 
added. | ; • ■ ;. 

checkpoint file Is given to psi-blast to search 
he renJainihg conserved regions of the consensus se- 
not included in the seed checkpoint file. The top hit 
tc> the! alignment corresponding to the seed check- 
le, pd the corjservation over the entire aUgnment of 
jed reg^ns, X^, is calculated. If R e is greater than or 
equal tjb trie R^ bf the sjeed checkpoint file, then conservation 
h^is noj decreased by adding this consensus sequence. There- 
fore, th s consensus sequence is added to the alignment, and 
the ch<KfckpjDint file Is rebuilt. The process repeats: The check- 



BEST AVAILABLE COPY 



Best Available Copy 

09/05/2007 i5:45 FAX 6176950892 WSGL 



Downloaded from www.geh6rnejorg.on S iptemDer 5, 2007 



sec ring 



point file Is used as a query for psi-bi^st 
the conserved regions of the remaining 
and the decision to add the highest 
whether the hit does not decrease corfceryation 
process terminates, the sequences foun<l 
blast search that correspond to the concensus 
the final checkpoint file are used In sub sequent 
diction. These sequences tend to align gl< tooljiy with thji 
sequence and usually belong to a small cl ideiwithtn the qu 
protein's family. 



jto search amqnt 
x bnSensus set; uence: 
1 hit depends Jn 
A^ter thi 
inl the initial vst 
sequejncesj! 
steps, of p« - 
jequ 



Position-Specific Probability Estimac ph 
The multiple alignment of the query se 
that were chosen as described In the p 
extracted from the Initial PS I -blast 
alignments have been shown to be fairly 
comparison to other sequence alignmei It 
Dunbrack 2000). The alignment is 
specific scoring matrix (PSSM; Gribskov fejt 
an I x 20 matrix where Ms the 
Each matrix entry, p lv , is the probabibtfc 
position c of the protein where c ranges 
any one of the 20 amino acids. The pro! 
a appearing at position c is estimated 
formula {Henikoff and Henikoff 1996) 



net with sequi 
rcvlpus paragraphing 
results. PSij-BLAst 
accurate andllongiii \ 
tools (Sauder arr 1, 
into a pbsltldj 
al,; ; i9o7). A 
length of t£ie protein sequenL 
of amino afid aj 
from 1 to / and dm 
b lbUity of amipo acl 1 
e following jgi 



Jbyjhi 



Pcw "(N f + B c )* Sea + (N e + 



N c is set to the total number of sequences 
and g^ is the sequence-weighted frequen cy 
appears at position c In the alignment (H intkoff 
1994), If an alignment position include J 
tributed among the amino adds as follows 
quency of gaps observed at position c, t5 ten 
acids a, the count g^ is incremented by j./2t> 
Because the observed sequences similar! 
only those available in the sequence datz bj 
docounts f ca are added to the observed counts 
acid In each column of the alignment (Hi 
1996). f„ is calculated from a 13-compbnent 
ture (Sjolander et aL 1996), and B t . is .the ' 
pseudocounts. Thus, p„ is a weighted av'< rage 
amino acid frequencies in the alignment 
observed frequencies. Foi sift, we wax ted; 
counts more weight relative to observe ji 
amino adds present at a position are mor ? 
this, we chose B e to be an exponential function 
diversity measure, D r Let the reference 
tion be the amino acid that appears 
quency and let r a be the rank that 
ordered list from the highest to lowest s< 
tion matrix for the reference amino acid. (BLOSUM62lls usjc 
to compute r^ but other substitution t lattices should glj 
similar results). So t u = 1 for the reference amino acicL Tto* 
D e = X„ (r„ * g^). At an Invariant positloiji w£ set B £ *» 0, oti 
erwlsc B c = cxp(DJ. 
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Prediction 

To automate sift, we wanted to apply 
umns of the PSSM calculated in the 
most diverse alignment column possible; 
might appear in a position with equal pre babUity 
20, whereas in a conserved position 
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gaps, they are 
j:lf gc is [he 
for all 20t ar 

S to the query ate 
iasjj searched, psef 
jnts for each| amir] 
i'nikoffandHenikr 1 
Dirichletmi 
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of the opserv&j} 
arid estJmai 
to give r)seu< 



c punts when tri 
diverse. To achie^ 1 
of a welghti 
ajrnlrio add lnj a po& 
!the highest til 
acid a ha£ in ' 
f>rejfrom a substiri 
.'(BL&SUM62 " 1 
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appear, one wtth probability 0.05 and the other with 
ut \f, fqr jsxample, 0.05 were chosen as a cutoff for p M 
substitution to amino acid a in column c Is predicted 
4S if p Cff ' ^ 0.05, then substitution to any amino 
predicted as deleterious In the column where 
all 4, yhen this is obviously a very tolerant 
:utoff cannot be applied to p^, alone. Instead, 
rmalized on the consensus amino acid in each 
,1s the amino acid with the highest p, w . The 
amtrio acijl may be different from the reference 
defined iti the previous section because pseudo- 
included. 

with nbnnallzed probabilities <0.05 are pre- 
d&leteriobs; those a0.05 are predicted to be tol- 
Th'ls ci toff was chosen for the Lacl data set and then 
ba nerippjiage T4 lysozyme and HIV protease data 
in ly decide from examining the probability dls- 
heiher a substitution with a probability near the 
?e reclassified as deleterious if predicted toler- 



v^rsa. 



rdtcln alignment may not extend to the ends of 
t N- 'and C-termlnal positions may contain 
nee Information, we arbitrarily chose to pre* 
positions whejre >50% of the sequences were repre- 
' probabilities and predictions are returned 
tion; d user can Judge whether enough se- 
itfcd at the position to rely on the predlc- 



re >resent< 



q'uenc a are 
taon. I 

Avail Ibllity 

A seqi encje, re lated sequences, or a sequence alignment can 
t?e sut mitted or sipp prediction at the BLOCKS Web site: 
rlttp:A )Ic^ksiibc.ori/~pauline/SIFT.htrnJ. If a sequence is 
sbbm ted, related sequences are returned along with 
S&FT jjpecjlctwijns so that the user can manually refine the 
th£ alignment and resubmit for prediction. 
.. HIV-T jprotease alignments and results from 
BLOSl M4s and BLOSUM80 predictions can also be obtained 
at this jsitej. 
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